SKDBERT: Compressing BERT via Stochastic Knowledge Distillation
نویسندگان
چکیده
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. each distillation iteration, SKD samples a teacher from pre-defined team, which consists of multiple models with multi-level capacities, transfer knowledge into student in an one-to-one manner. Sampling distribution plays important role SKD. We heuristically present three types sampling distributions assign appropriate probabilities for models. has two advantages: 1) it can preserve the diversities via stochastically single and 2) also improve efficacy when large capacity gap exists between model. Experimental results on GLUE benchmark show that SKDBERT reduces size BERT by 40% while retaining 99.5% performances understanding being 100% faster.
منابع مشابه
Sequence-Level Knowledge Distillation
Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neura...
متن کاملTopic Distillation with Knowledge Agents
This is the second year that our group participates in TREC’s Web track. Our experiments focused on the Topic distillation task. Our main goal was to experiment with the Knowledge Agent (KA) technology [1], previously developed at our Lab, for this particular task. The knowledge agent approach was designed to enhance Web search results by utilizing domain knowledge. We first describe the generi...
متن کاملCompressing Rank-Structured Matrices via Randomized Sampling
Randomized sampling has recently been proven a highly efficient technique for computing approximate factorizations of matrices that have low numerical rank. This paper describes an extension of such techniques to a wider class of matrices that are not themselves rank-deficient but have off-diagonal blocks that are—specifically, the classes of so-called hierarchically off-diagonal low rank (HODL...
متن کاملKnowledge Distillation for Bilingual Dictionary Induction
Leveraging zero-shot learning to learn mapping functions between vector spaces of different languages is a promising approach to bilingual dictionary induction. However, methods using this approach have not yet achieved high accuracy on the task. In this paper, we propose a bridging approach, where our main contribution is a knowledge distillation training objective. As teachers, rich resource ...
متن کاملWebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation
Despite important progress in the area of intelligent systems, most such systems still lack commonsense knowledge that appears crucial for enabling smarter, more human-like decisions. In this paper, we present a system based on a series of algorithms to distill fine-grained disambiguated commonsense knowledge from massive amounts of text. Our WebChild 2.0 knowledge base is one of the largest co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i6.25902